Back

Briefings in Bioinformatics

Oxford University Press (OUP)

Preprints posted in the last 90 days, ranked by how well they match Briefings in Bioinformatics's content profile, based on 11 papers previously published here. The average preprint has a 0.04% match score for this journal, so anything above that is already an above-average fit.

1
A rapid ONT-based sequencing approach to capture complete Ataxia-Mutomes (AtaxiaMutSeq)

De, T.; Faruq, M.

2025-12-19 genetic and genomic medicine 10.64898/2025.12.18.25342589
Top 0.2%
56× avg
Show abstract

Hereditary ataxias are complicated neurological disorders with enormous genetic heterogeneity as well as the diverse genetic mechanism. Among different genetic mechanism, tandem nucleotide repeat expansion (TNRex) are the most common cause for genetic ataxias followed by single nucleotide variations in over 200 genes. The detection and the diagnosis of tandem nucleotide repeats in clinics and laboratories has been at large common in comparison with SNVs owing to the large number of the mutations in the respective genes they are found. The widely used platforms for detection of these mutations are capillary electrophoresis and Next generation sequencing based targeted gene panel or clinical or whole exome sequencing. Long read sequencers have been proven useful for detection of tandem nucleotide repeat expansions. We have evolved a method to detect in one experiment and on single platform the detection of TNRex and SNVs on Oxford Nanopre Technology using adaptive sequencing approach. We were able to optimize the target region sequencing of both TNR loci and SNV-loci and validate the capture of both by detection of FXN-GAA repeats and pathogenic SNVs in SETX

2
Survival risk heterogeneity among patients with NSCLC receiving nivolumab visualized by risk scores generated from deep learning method DeepSurv using tumor gene mutations

Nishiyama, N.

2026-02-22 oncology 10.64898/2026.02.15.26346303
Top 0.3%
39× avg
Show abstract

Immunotherapy with immune checkpoint inhibitors and immunotherapy combined with chemotherapy have represented promising treatments for NSCLC patients leading to prolonged survival. However, the majority of patients with advanced NSCLC have a poor prognosis. The identification and development of biomarkers for stratifying responders and non responders to immune checkpoint inhibitors contribute to unravel the mechanism of immune checkpoint pathway and the immune tumor interaction underlying the responses and are urgently needed to improve clinical outcomes of immune checkpoint inhibitor treatment. In this study, we analyzed the clinical and gene mutation data of NCSLC patients treated with nivolumab containing immunotherapy or nivolumab containing immunotherapy combined with chemotherapy (the immunotherapy treated group, n=119) and chemotherapy alone (the chemotherapy alone treated group, n=991) extracted from the MSK CHORD dataset. A DeevSurv model, a deep learning based extension of the Cox proportional hazards model was trained to generate survival risk score of each patient with binary statuses of thirty one gene mutations as input features into the model. The thirty one genes were selected based on population level mutation frequency, patient level variance in mutation status, and univariate Cox proportional hazards analyses evaluating the association between the presence or absence of each gene mutation and overall survival. The performance of the trained DeepSurv model was evaluated on the test set of the immunotherapy treated group using the concordance indexes (C index). The trained model was subsequently applied without retraining to the entire chemotherapy alone treated group as a control. The resulting C indexes for the immunotherapy treated group and chemotherapy alone treated group were 0.789 and 0.483, respectively. All patients within each group were divided into high and low risk groups according to the median predicted risk score. Kaplan Meier survival curves of high and low risk groups (n=43 vs n=70) in the immunotherapy treated group revealed a significant separation (log rank p<0.001), whereas no separation was observed in chemotherapy alone treated group (p=0.62). In the combined cohort of the immunotherapy treated group and chemotherapy alone treated group, the interaction between the DeepSurv derived risk score and treatment modality was significant (HR for interaction 1.47, 95% CI from 1.32 to 1.65, p<0.005), suggesting the DeepSurv derived risk score predictive value specific to the immunotherapy. Principal component analysis and permutation importance analysis were performed as complementary analyses to assess individual genes associated with the DeepSurv derived risk score and identified ZFHX3, SMARCA4, ALK, BTK, and NOTCH2 as major contributors to survival risk stratification. Collectively. we suggested that nonlinear coupling pattern of 31 tumor gene mutation statuses in the DeepSurv model captures the heterogeneity of survival risk among nivolumab containing immunotherapy or nivolumab containing immunotherapy combined with chemotherapy treated patients with NSCLC which was visualized as clear separation between high risk and low risk groups divided by the median value of the risk scores.

3
Integrating multi-omics and multi-context QTL data with GWAS reveals the genetic architecture of complex traits and improves the discovery of risk genes

Qian, S.; Luo, K.; Sun, X.; Crouse, W.; Liang, L.; Gu, J.; Stephens, M.; Zhao, S.; He, X.

2025-12-27 genetic and genomic medicine 10.64898/2025.12.19.25342620
Top 0.4%
31× avg
Show abstract

Recent studies showed that expression QTLs, even from trait-related tissues, explained a small fraction of complex trait heritability. A natural strategy to close this gap is to incorporate molecular QTLs (molQTLs) beyond gene expression, across diverse tissue/cellular contexts. Yet, integrating such QTL data presents analytical challenges. Molecular traits often share QTLs or have QTLs in high LD, complicating the attribution of GWAS signals to specific molecular traits. Our simulations showed that commonly used colocalization and TWAS methods have highly inflated false positive rates in such settings. Building on our earlier work, we developed multi-group causal TWAS (M-cTWAS), for integrating QTLs of different modalities and contexts. M-cTWAS is able to estimate the contribution of each group of molQTLs to the trait heritability, and using such information, identifies the causal molecular traits, informing the modalities and contexts through which genetic variations act on the phenotype. M-cTWAS showed improved control of false discoveries than commonly used methods. Using M-cTWAS, we found that QTLs of multiple modalities greatly increased the explained heritability compared to using eQTLs alone, and enabled the discovery of many more risk genes of a range of complex traits. In conclusion, M-cTWAS effectively integrates diverse molecular QTLs with GWAS to enable causal gene discovery.

4
Constructing a Literature-Derived Database for Benchmarking Polygenic Risk Score Construction Methods with Spectral Ranking Inferences

Sebastian, C.; Yu, M.; Jin, J.

2026-03-03 genetic and genomic medicine 10.64898/2026.03.01.26347258
Top 0.4%
30× avg
Show abstract

Polygenic risk scores (PRSs) have emerged as a valuable tool for genetic risk prediction and stratification in human diseases. Over the past decade, extensive methodological efforts have focused on improving the predictive power of PRS, leading to the development of numerous methods for PRS construction. Benchmarking these various methods thus becomes an essential task that is crucial for guiding future PRS applications. While studies have benchmarked subsets of these methods on specific phenotypes and cohorts, the resulting evidence remains fragmented, with a lack of work that comprehensively assess the relative performance of the various PRS methods. In this study, we addressed this gap by systematically constructing a PRS method benchmarking database synthesizing published results from 2009 to 2025. We applied a spectral ranking inference framework with uncertainty quantification to rank 14 PRS methods that had been adequately compared against each other in the literature. We constructed rankings using two complementary sources: original method-development studies and applications/benchmarking studies. While the highest-ranked methods (LDpred2 and AnnoPred) and the lowest-ranked method (C+T) were consistently identified from both sources, the relative ordering of most methods showed moderate variability. We further constructed phenotype-specific rankings, providing more detailed insights into the robustness and phenotype-specific strengths of individual methods. Collectively, the overall and phenotype-specific rankings of the PRS methods, along with the curated benchmarking data from the literature, provide a dynamic and practical reference database that can continuingly be updated with emerging new PRS methods and published benchmarking results to guide future PRS applications.

5
Predicting Protein Cascade Expression from H&E Images

Leyva, A.; Akbar, A. R.; Niazi, M. K. K.

2026-01-24 pathology 10.64898/2026.01.23.26344725
Top 0.5%
29× avg
Show abstract

Protein expression within oncogenic or suppressive pathways is a hallmark indicator of oncogenesis. While traditional AI models in digital pathology attempt to predict singular proteins, there is a need to predict the downstream expression of proteins to indicate the propagation of signals. RNA expression provides novel information, but does not provide information about the downstream propagation of protein signals or whether those signals are functional. Using Reverse Phase Protein Array (RPPA) data with whole-slide images (WSIs) from the publicly available Cancer Genome Atlas Breast Adenocarcinoma dataset (TCGA-BRCA), we predict the expression of five key proteins identified from the apoptosis cascade, using DNA damage and repair (DDR) cascades as a biological control. Furthermore, we examine the performance of patch-level Vision Transformers (ViT) on the regression task, which was tested against the designed cellular-level ViT, CellRPPA. Our results demonstrate that patch-level vision transformers were unable to obtain statistically significant predictive results, achieving R-squared values {inverted exclamation} 0.1 for all folds. In addition, CellViT obtained R-squared values {inverted question} 0.1 in all five test folds. We also show that morphologically indicative cascades, such as the apoptosis cascade, provide significantly higher performance compared to the DDR cascade.

6
An Integrated Deep Learning Framework for Small-Sample Biomedical Data Classification: Explainable Graph Neural Networks with Data Augmentation for RNA sequencing Dataset

Guler, F.; Goksuluk, D.; Xu, M.; Choudhary, G.; agraz, m.

2026-02-24 genetic and genomic medicine 10.64898/2026.02.22.26346827
Top 0.5%
21× avg
Show abstract

Applying deep learning models to RNA-Seq data poses substantial challenges, primarily due to the high dimensionality of the data and the limited sample sizes. To address these issues, this study introduces an advanced deep learning pipeline that integrates feature engineering with data augmentation. The engineering application focuses on biomedical engineering, specifically the classification of RNA-Seq datasets for disease diagnosis. The proposed framework was initially validated on synthetic datasets generated from Naive Bayes, where MLP-based augmentation yielded a notable improvement in predictive performance. Building on this foundation, we applied the approach to chromophobe renal cell carcinoma (KICH) RNA-Seq data from The Cancer Genome Atlas (TCGA). Following standard preprocessing steps normalization, transformation, and dimensionality reduction, the analysis concentrated on three main aspects: augmentation strategies, preprocessing methods, and explainable AI (XAI) techniques in relation to classification outcomes. Feature selection was performed through PCA, Boruta, and RF-based methods. Three augmentation strategies linear interpolation, SMOTE, and MixUp were evaluated. To maintain methodological rigor, augmentation was applied exclusively to the training set, while the test set was held out for unbiased evaluation. Within this framework, we conducted a comparative assessment of multiple deep learning architectures, including MLP, GNN, and the recently proposed Kolmogorov-Arnold networks (KAN). The GNN achieved the highest classification accuracy (99.47%) when trained with MixUp augmentation combined with RF feature selection, and achieved the best F1 score (0.9948). Consequently, the GNN-based XAI framework was applied to the RF dataset enriched with MixUp. XAI analyses identified the top 20 most influential genes, such as HNF4A, DACH2, MAPK15, and NAT2, which played the greatest role in classification, thereby confirming the biological plausibility of the model outputs. To further validate model robustness, cervical cancer and Alzheimers RNA-Seq datasets were also tested, yielding consistent and reliable results. Overall, the findings highlight the value of incorporating data augmentation into deep learning models for RNA-Seq analysis, not only to improve predictive performance but also to enhance biological interpretability through explainable AI approaches.

7
Retrospective evaluation of human genetic evidence for clinical trial success using Mendelian randomization and machine learning

Ravarani, C. N. J.; Arend, M.; Baukmann, H. A.; Cope, J. L.; Lamparter, M. R. J.; Sullivan, J. K.; Fudim, R.; Bender, A.; Malarstig, A.; Schmidt, M. F.

2026-02-23 pharmacology and therapeutics 10.64898/2026.02.19.26346536
Top 0.6%
20× avg
Show abstract

Human genetics has become a cornerstone of drug target discovery, yet the value of Mendelian randomization (MR) for predicting clinical success remains uncertain. Here, we systematically evaluated MR across 11,482 target-indication pairs with documented Phase II clinical outcomes to assess its utility for drug development. We find that MR statistical significance alone does not enrich for Phase II success, in contrast to genome-wide association study (GWAS) support, which confers an increase in success probability. However, this apparent limitation reflects the heterogeneous nature of clinical failure and the fact that MR encodes information beyond P values. When MR-derived features, including instrument strength and explained variance, are integrated into machine learning models, predictive performance improves substantially. An MR-informed XGBoost classifier identifies target-indication pairs with a 55% overall approval rate, corresponding to a 6.4-fold enrichment over unstratified programs and a 2.8-fold improvement over GWAS- supported targets in Phase II. Notably, this enrichment is achieved without reliance on statistically significant MR results. Our findings demonstrate that MR is most informative when treated as a graded, context-dependent source of causal evidence rather than a binary hypothesis test, and that its integration with machine learning enables scalable, genetics-informed prioritization of drug targets across the clinical pipeline.

8
Assessing the Role of Model Complexity in Virtual Clinical Trial Outcomes

Gevertz, J. L.; Wares, J. R.

2025-12-27 pharmacology and therapeutics 10.64898/2025.12.22.25342808
Top 0.6%
20× avg
Show abstract

Virtual clinical trials (VCTs) hold significant promise for improving the drug development process, yet their predictive reliability depends critically on design decisions that remain poorly understood. This study examines how model complexity influences VCT outcomes, as well as how the choice of prior parameter distributions and virtual patient inclusion criteria affects those outcomes. Using oncolytic virotherapy treatment of murine tumors as a case study, we compared three mathematical models of varying complexity under different parameter priors (uniform and normal distributions) and two inclusion methods (accept-or-reject and accept-or-perturb). Our results demonstrate that the simplest model produces a plausible population that inadequately spans the feasible trajectory space, potentially missing critical interpatient heterogeneity. However, we found diminishing returns beyond intermediate model complexity, as both the intermediate and complex models captured similar ranges of patient responses across dosing protocols. Notably, the accept-or-reject method generated posterior parameter distributions that resembled the chosen priors, possibly overly reducing interpatient variability in treatment responses, particularly at high doses. In contrast, the accept-or-perturb inclusion criteria produced more robust results that were less sensitive to prior assumptions. These findings suggest that VCT design should prioritize models with sufficient biological detail to capture key mechanisms without unnecessary complexity, paired with inclusion criteria that avoid over-constraining plausible populations to match potentially unrealistic prior assumptions.

9
A Novel Open Access Multimodal Dataset Of Nodule Imaging And Circulating Proteome From A Lung Cancer Screening Cohort

Cobo, M.; Serrano, D.; Barranco, J.; Pasquier, A.; de-Torres, J. P.; Zulueta, J. J.; Echeveste, J. I.; Ezponda, A.; Argueta, A.; Sanz-Ortega, J.; Berto, J.; Alcaide, A. B.; di Frisco, M.; Felgueroso, C.; Campo, A.; de la Fuente, A. A.; Escobar, A.; Valencia, K.; Orive, D.; Ocon, M. d. M.; Globacka, H. B.; Fortuno, M. A.; Perna, V.; Rodriguez, M.; Lozano, M. D.; Calvo, A.; Pio, R.; Hung, R. J.; Seijo, L. M.; Silva, W.; Bastarrika, G.; Lloret Iglesias, L.; Montuenga, L. M.

2025-12-27 oncology 10.64898/2025.12.23.25342921
Top 0.6%
20× avg
Show abstract

IntroductionLow-dose computed tomography (LDCT) lung cancer screening has significantly enhanced early detection and patient survival rates in the population at risk. Current screening methods, that primarily rely on LDCT imaging, will very likely benefit from molecular biomarkers to achieve a more comprehensive, accurate, personalized and non-invasive risk assessment leveraging multimodal tools. We present a novel open access multimodal (imaging, proteomics and demographic) dataset designed to provide an available research resource on LDCT-based early lung cancer detection. The dataset includes annotated screening LDCT scans and plasma proteomics generated by proximity extension assay (Olink) platform. MethodsThe dataset integrates data from control screened individuals without nodules or with benign nodules, and LDCT-diagnosed lung cancer individuals, matched by sex, age and time between image and sample collection. Both radiological and molecular signatures were collected within a six month window, providing detailed insights into disease progression. Nodules were considered as lung cancer cases if biopsy-confirmed lung cancer was diagnosed within 5 years after imaging, enabling the study of longitudinal biomarker evolution and its correlation with imaging findings. To complement the dataset, clinical and demographic data are also available in open access, providing a detailed overview of patient characteristics. The informed consent signed by the participants allows for unrestricted open access for requests directy or indirectly related to lung cancer research. ResultsThe dataset consists of annotated screening LDCT scans and plasma proteomics data measured with most of the Olink Target 96 platforms (1078 individual proteins across 12 panels focused on a specific area of disease or biology) for a total of 211 screening participants. There are 67 lung cancer patients, 68 matched controls with benign pulmonary nodules, 71 matched controls without nodules and 5 surgically excised false positive lesions. Experiments were performed to assess the technical quality and provide a proof-of-concept of usability of the dataset, showing the alignment with findings from previous published studies. ConclusionThis comprehensive dataset aims to facilitate research towards the development of personalized multimodal artificial intelligence models. We also aim to support the investigation of the relationship between imaging and molecular data, paving the way for more accurate understanding of early lung cancer biology. Finally, our open access dataset may help to develop or validate individualized risk prediction models that could significantly advance early lung cancer detection and intervention strategies.

10
JointMR: A joint likelihood-based approach for causal effect estimation in overlapping Mendelian Randomization studies

Wu, S.; Hou, L.; Yuan, Z.; Sun, X.; Yu, Y.; Chen, H.; Huang, L.; Li, H.; Xue, F.

2025-12-19 genetic and genomic medicine 10.64898/2025.12.18.25342634
Top 0.6%
20× avg
Show abstract

The integration of causal effect estimates from multiple Mendelian Randomization studies has become increasingly popular. However, the presence of overlapping databases compromises traditional meta-analysis, leading to inflated variance and reduced statistical power. Here, we propose JointMR, a joint likelihood-based approach designed to integrate multiple GWAS summary databases while explicitly accounting for the covariance matrix of the Wald ratio estimates. Specifically, to accommodate potential cross-study heterogeneity, JointMR incorporates both fixed-effect and random-effects models. Simulations demonstrated that JointMR provides unbiased estimates with higher statistical power and superior Type I error control compared to conventional meta-analysis methods of standard MR estimates (e.g., IVW), especially as database correlation increases. In a real-data application examining total cholesterol, HDL-C, LDL-C and triglycerides on type 2 diabetes, JointMR resolved contradictions seen in standard approaches, generating stable and biologically plausible estimates. In conclusion, JointMR overcomes critical limitations of existing methods, offering a more powerful and reliable tool for robust causal inference from the growing repository of GWAS summary statistics.

11
Genome-Wide Significance Reconsidered: Low-Frequency Variants and Regulatory Networks in Autism

Mendes de Aquino, M.; Engchuan, W.; Thompson, S.; Zhou, X.; Safarian, N.; Chen, D. Z.; Trost, B.; Salazar, N. B.; Ma, C.; Thiruvahindrapuram, B.; Vorstman, J.; Scherer, S. W.; Breetvelt, E.

2026-02-12 genetic and genomic medicine 10.64898/2026.02.11.26346090
Top 0.7%
19× avg
Show abstract

Low-frequency variants (LFVs), defined by minor allele frequencies (MAF) of 1-5%, occupy the gap between common and rare variants in both frequency and effect size. The conventional genome-wide association study (GWAS) significance threshold (5x10-) is overly conservative for LFVs, which account for more than 25% of variants in GWAS. This limitation may obscure meaningful associations in highly heritable yet genetically complex disorders such as autism spectrum disorder (ASD). We hypothesize that the scarcity of significant LFVs in ASD GWAS reflects statistical constraints rather than a true lack of association. To address this, we derived a MAF-specific genome-wide significance threshold using linkage disequilibrium-informed simulations applied to ASD GWAS summary statistics, identifying 2.03x10- as optimal. Applying this threshold revealed three novel LFVs mapping to zinc finger proteins (ZNF420, ZNF781) and known ASD-related genes (KMT2E, PRKDC, MCM4). Enrichment analyses suggested their function in nervous system development and gene regulation. Our findings highlight the contribution of LFVs to ASD risk and underscore the importance of frequency-aware association strategies.

12
Benchmarking HLA genotyping from whole-genome sequencing across multiple sequencing technologies

Cremin, C.; Elavalli, S.; Paulin, L.; Arres Reche, J.; Saad, A. A. Y. A.; Attia, A.; Minas, C.; Aldhuhoori, F.; Katagi, G.; Wu, H.; Sidahmed, H.; Mafofo, J.; Soliman, O.; Behl, S.; Pariyachery, S.; Gupta, V.; Ghanem, D.; Sajjad, H.; Cardoso, T.; El-Khani, A.; Al Marzooqi, F.; Magalhaes, T.; Sedlazeck, F. J.; Quilez, J.

2026-02-12 health informatics 10.64898/2026.02.10.26345621
Top 0.7%
19× avg
Show abstract

BackgroundThe hyperpolymorphic nature and structural complexity of the human leukocyte antigen (HLA) genomic region present challenges for accurate and scalable typing across diverse sample types. While wholegenome sequencing (WGS) offers the opportunity to infer HLA genotypes without targeted enrichment, systematic benchmarks across sequencing platforms, biospecimens and coverage levels remain limited. ResultsWe assembled a multi-platform resource of WGS datasets derived from short-read (Illumina, MGI) and long-read (Oxford Nanopore Technologies R9 and R10) sequencing, spanning 29 biospecimens including cell lines, blood, buccal swab and saliva. We evaluated the performance of the HLA caller HLA*LA across 13 HLA genes, using a clinically validated assay as reference. WGSbased HLA genotyping achieved [~]95% accuracy across sequencing platforms, with Class I loci exhibiting higher accuracy than Class II. Crossplatform concordance was high, and performance remained consistent across Illumina, MGI and Oxford Nanopore chemistries. Analysis of blood, buccal swab and saliva samples showed that blood and buccal swabs supported accurate HLA inference, whereas saliva yielded reduced concordance. Downsampling experiments demonstrated that 15x coverage was sufficient to retain >95% accuracy at twofield resolution, with lower depths supporting lower-resolution typing. ConclusionsOur results demonstrate that WGS provides a robust, platformagnostic framework for accurate HLA genotyping across sample types and coverage levels. These benchmarks establish practical conditions for reliable HLA inference and underscore the utility of WGS for populationscale HLA analyses and future clinical applications.

13
Improving Quality of CAR-T Cell Therapy Starting Material with Automated Microfluidic Cell Sorting

Skelley, A.; Behmardi, Y.; Petersen, L.; Shehada, M.; Ouaguia, L.; Gandhi, K.; Campos-Gonzalez, R.; Ward, T.

2025-12-18 pharmacology and therapeutics 10.64898/2025.12.16.25342401
Top 0.7%
18× avg
Show abstract

Autologous CAR-T cell therapy has demonstrated remarkable clinical efficacy in hematologic malignancies, yet its broader application remains limited by complex, labor-intensive manufacturing and inconsistent product quality. We describe a novel microfluidic cell separation platform based on Deterministic Lateral Displacement (DLD), integrated into a fully automated, closed-system instrument (Curate System), capable of processing full leukopacks in under one hour. Compared to Ficoll(R)-based density gradient centrifugation, DLD processing yielded significantly higher leukocyte recovery (88% vs. 58%), superior platelet and red blood cell depletion, and reduced CD69 T-cell activation. Flow cytometric analysis revealed improved phenotypic preservation across key T-cell subsets, including naive and central memory populations. Cytokine profiling demonstrated enhanced washing efficiency, with markedly lower levels of biologic response modifiers such as RANTES and TGF-{beta}1. DLD-purified T cells exhibited enhanced expansion kinetics and greater yield, supporting improved manufacturing outcomes. These findings position DLD-based processing as a clinically relevant, scalable alternative to conventional methods, with potential to improve consistency, potency, and accessibility of CAR-T therapies.

14
Federated penalized piecewise exponential model for horizontally distributed survival data: FedPPEM

Islam, N.; Luo, C.; Tong, J.; Polleya, D. A.; Jordan, C. T.; Haverkos, B.; Bair, S.; Kent, A.; Weller, G.

2026-02-12 health informatics 10.64898/2026.02.11.26346054
Top 0.8%
17× avg
Show abstract

Cox proportional hazard regressions are frequently employed to develop prognostic models for time-to-event data, considering both patient-specific and disease-specific characteristics. In high-dimensional clinical modeling, these biological features can exhibit high collinearity due to inter-feature relationships, potentially causing instability and numerical issues during estimation without regularization. For rare diseases such as acute myeloid leukemia (AML), the sparsity and scarcity of data further complicate estimation. In such cases, data augmentation through multi-site collaboration can alleviate these problems. However, this often necessitates sharing individual patient data (IPD) across sites, which presents challenges due to regulatory barriers aimed at protecting patient privacy. To overcome these challenges, we propose a privacy-preserving algorithm that eliminates sharing IPD across sites and fits a federated penalized piecewise exponential model (FedPPEM) to estimate potential effects of clinical features using summary statistics. This algorithm yields results nearly identical to those from pooled IPD, including effect size and standard error estimates. We demonstrate the models performance in quantifying effects of clinical features and genetic risk classification on overall survival using real-world data from [~]1200 newly diagnosed AML patients across 33 U.S. sites. Although applied in AML context, this model is disease-agnostic and can be implemented in other diseases and clinical contexts.

15
Physiology-Informed Conditional Variational Autoencoder for Generating Pediatric Virtual Patients

Irie, K.; Mizuno, T.

2026-01-24 pharmacology and therapeutics 10.64898/2026.01.21.26344442
Top 0.8%
17× avg
Show abstract

Reliable pediatric virtual patients are essential for model-informed simulations, including physiologically based pharmacokinetic (PBPK) modeling, to support dose selections in children and to evaluate drug exposure across developmental stages. Despite the availability of extensive pediatric physiological data and age- or size-based models, there remains a lack of well-established, flexible, and scalable approaches for integrating these data into realistic pediatric virtual patients that preserve multivariate physiological correlations and whole-body coherence across diverse clinical conditions and population needs. In this proof-of-concept study, we developed a physiology-informed conditional variational autoencoder (cVAE) to address this challenge. The model was trained using real-world pediatric data augmented with mechanistically derived physiological information and conditioned on age and sex. It generated realistic physiological parameters, including body size, estimated glomerular filtration rate, organ weights, and blood flows, while biological plausibility was maintained through embedded physiological constraints. The trained model demonstrated high reconstruction accuracy, with a mean absolute error of 0.0043 and an R{superscript 2} of 0.998, and the generated distributions closely matched those of the training data. All synthesized physiological profiles satisfied predefined physiological constraints, with total organ mass remaining below body weight and the sum of organ blood flows not exceeding cardiac output. Latent-space analyses further revealed smooth developmental patterns, enabling targeted physiological profile generation. The applicability of the generated physiological data was further demonstrated through PBPK simulations conducted across the pediatric age range using vancomycin as a testbed. Overall, this physiology-informed generative framework supports coherent pediatric virtual patient generation for PBPK modeling and model-informed dosing applications development. Study HighlightsO_ST_ABSWhat is the current knowledge on the topic?C_ST_ABSModel-informed simulation, including physiologically based pharmacokinetic (PBPK) modeling, provides useful estimation of pediatric drug disposition across developmental stages and supports pediatric dose selection. However, constructing physiologically coherent pediatric virtual populations remains challenging. Although real-world pediatric measurements and physiologically derived, function-based information are available, these data are typically obtained from heterogeneous sources. Integrating them to generate multivariate physiological profiles at the individual level, while maintaining internal coherence across interconnected organ systems, remains an open challenge in pediatric pharmacometric modeling. What question did this study address?Can a conditional generative modeling and latent representation learning framework that integrates real-world pediatric data with mechanistically derived physiological constraints generate biologically coherent, multivariate pediatric physiological profiles that are suitable for downstream PBPK modeling across the full pediatric age range? What does this study add to our knowledge?This study introduces a physiology-informed conditional variational autoencoder that learns a smooth, interpretable latent space of pediatric physiology conditioned on age and sex. By embedding physiological constraints directly into the training objective, the model generates virtual pediatric patients with internally consistent body size, renal function, organ weights, and blood flows. The utility of these generated profiles was demonstrated through latent-space inversion and vancomycin PBPK simulations that reproduced reported age-dependent exposure trends and variability. How might this change clinical pharmacology or translational science?This framework provides a scalable approach for generating physiologically coherent pediatric virtual populations, laying a foundation for robust and flexible mechanistic PK simulations, virtual clinical trials, and digital twin applications. It offers a practical bridge between real-world pediatric data and mechanistic modeling, supporting model-informed dosing and translational decision-making in pediatric patient care and drug development.

16
Wakhan: reconstruction of chromosome-scale copy number profiles of tumor genomes with long-read sequencing

Ahmad, T.; Keskus, A. G.; Aganezov, S.; Goretsky, A.; Rodriguez, I.; Yoo, B.; Lansdon, L. A.; Repnikova, E. A.; Zhang, L.; Liu, Y.; Donmez, A.; Bryant, A.; Tulsyan, S.; Park, J.; Gardner, J.; McNulty, B.; Sacco, S.; Shetty, J.; Zhao, Y.; Tran, B.; Malikic, S.; Day, C.-P.; Miga, K.; Paten, B.; Sahinalp, C.; Farooqi, M. S.; Dean, M.; Kolmogorov, M.

2025-12-15 genetic and genomic medicine 10.64898/2025.12.11.25342098
Top 0.8%
17× avg
Show abstract

A common signature of cancer genomes is a complex, rearranged karyotype, characterized by acquired gains or losses of chromosomal material, referred to as somatic copy number alterations (CNAs). Identification of haplotype-specific CNAs from bulk sequencing data is a key step in many short-read cancer genomic workflows; however, short reads have a limited phasing range. In contrast, long reads can directly phase genomic variants into contiguous haplotypes. Here, we present Wakhan, a long-read method for haplotype-specific CNA calling that can reconstruct longer, up to chromosome-scale CNA profiles of rearranged cancer genomes. Using multi-technology sequencing of a cell line panel, combined with high-quality de novo assemblies, we show that Wakhan CNA profiles have better consistency with sequencing data, as compared to the other popular short- and long-read CNA callers. Further, we show that in combination with accurate somatic SV calls, Wakhan CNA profiles provide additional insights into mutational processes in various breast cancer genomes. Finally, we apply Wakhan to multiple pediatric cancer samples and illustrate the high consistency with standard clinical genetic testing.

17
Palette polygenic risk score framework improves risk prediction by capturing clinical heterogeneity of type 2 diabetes

Miyake, A.; Tanabe, H.; Narita, A.; Ojima, T.; Kyosaka, T.; Gocho, C.; Sakurai, R.; Takayama, J.; Yamakage, H.; Tanaka, K.; Kazama, J. J.; Satoh-Asahara, N.; Shimabukuro, M.; Tamiya, G.

2026-01-13 genetic and genomic medicine 10.64898/2026.01.12.25342123
Top 0.9%
17× avg
Show abstract

Polygenic risk scores (PRSs) are typically constructed under the assumption of a single, homogeneous disease phenotype. However, many common diseases exhibit considerable clinical heterogeneity and encompass multiple subtypes with distinct etiologies and clinical characteristics. As a result, conventional PRSs often overlook differences in underlying biological pathways among disease subtypes, consequently limiting predictive accuracy and cross-ancestry transferability. To address this challenge, we propose the "palette PRS," a framework that integrates a set of partitioned polygenic scores (pPSs) for biologically interpretable pathways with subtype-specific weights. This approach can flexibly capture the relative contributions of multiple pathways within each individual and provides a unified risk score. We applied this framework to type 2 diabetes (T2D), a clinically highly heterogeneous disease. For T2D, previous machine learning-based studies have identified four distinct subtypes and 12 biologically interpretable pathways derived from 650 genome-wide significant variants. Building on these established findings, we employed an elastic net model incorporating subtype membership probabilities to derive subtype-optimized palette PRS through the weighted integration of the pPSs of these 12 pathways. Our palette PRS showed superior predictive performance, with particularly high accuracy for the severe insulin-deficient diabetes (SIDD) subtype (AUC=0.744), compared with both conventional T2D PRS (AUC = 0.661) or subtype-stratified GWAS-based PRS (AUC = 0.547). Moreover, our palette PRS exhibited substantial cross-ancestry transferability between East Asian and European populations. This strategy represents a major step toward clinically actionable, subtype-optimized risk prediction and personalized prevention in T2D worldwide.

18
PHARMWATCH: A Multilayer Pharmacogenomics Safety System for Accurate Star Allele Interpretation

Eisenhart, C. E.; Brickey, R.; Mewton, J.

2026-02-28 genetic and genomic medicine 10.64898/2026.02.26.26347200
Top 1%
12× avg
Show abstract

The Clinical Pharmacogenetics Implementation Consortium (CPIC) bases its drug-gene recommendations on the assignment of star alleles, which map known genotypes to defined functional categories and corresponding drug dosage guidelines. The star allele framework, first proposed in 1996 for the CYP gene family and later formalized with CPICs establishment in 2010 [1, 2], remains foundational to pharmacogenomics. However, this system has notable limitations. Its dependence on a restricted set of benchmark single nucleotide polymorphisms (SNPs) excludes rare or novel pathogenic variants that can invalidate a star allele call and lead to incorrect dosing recommendations. Furthermore, nearby non-pathogenic variants can interfere with haplotype interpretation, introducing additional risk of misclassification. To address these gaps, we developed PHARMWATCH, a multistep pharmacogenomics workflow for comprehensive variant analysis, allele tracking, and contextual interpretation. PHARMWATCH incorporates two algorithmic safeguards designed to identify genomic alterations that compromise star allele accuracy: (1) de novo germline variant screening using the ACMG-based BIAS-2015 classifier and (2) variant interpretation in context (VIIC) to validate the functional integrity of star allele-defining SNPs [3]. Together, these layers enhance the reliability of pharmacogenomic reporting, enabling safe, automated, and review-ready recommendations that extend beyond the constraints of traditional star allele-based approaches.

19
GPAS: an online AI system for rapid and accurate pathogen identification and LLM-based interpretation

Li, T.; Hong, H.; Fan, D.; Li, J.; Li, T.; Wu, J.; Jiang, S.; Xie, X.; Zhang, Y.; Hu, M.; Yin, X.; Zhang, Y.; Ma, H.; Liu, Z.; Su, Z.; Yu, X.; Liu, Y.; Yuan, H.; Zheng, W.; Liu, H.; Ma, M.; Li, X.; Shen, Y.; Zhang, C.; Wang, Y.; Zhao, B.; Sun, L.; Han, Q.-Y.; Chen, J.; Zhang, K.; Chen, L.; Wang, N.; Li, W.; Man, J.; He, K.; Dong, F.; Du, F.; Yi, Y.; Li, A.; Zhou, T.; Zhang, X.; Li, T.

2026-02-20 public and global health 10.64898/2026.02.18.26346517
Top 1%
11× avg
Show abstract

Accurate identification of unknown pathogens is critical for medicine and public health, yet current metagenomic workflows remain heavily dependent on specialized bioinformatics expertise and manual interpretation, creating substantial bottlenecks in time-sensitive diagnostic settings1. The key challenges lie in achieving precise species identification amidst high background noise and translating complex microbial data into clinically actionable insights2,3. Here we present the Global Pathogen Analysis System (GPAS), an integrated computational framework that combines rapid and accurate pathogen identification with large language model (LLM)-based semantic interpretation. Central to GPAS is a dynamic-library alignment mechanism informed by prior probabilities of inter-species misclassification. By integrating a hybrid machine learning model that couples elastic neural networks with Bayesian inference, this approach substantially reduces both false positives and false negatives, achieving species-level accuracy superior to existing state-of-the-art tools. To enable clinical interpretation, we constructed a unified microbial knowledge graph integrating global metagenomic and metaviromic sample repositories, and trained a pathogen-specialized LLM agent. Through end-to-end reinforcement learning, the agent autonomously executes multi-step reasoning workflows extracting pathogen-specific insights from complex data and generating human-readable, evidence-based reports. Application to throat swab samples demonstrates that GPAS not only accurately identifies pathogenic microorganisms but also reveals how SLE-associated immune dysregulation reshapes the respiratory microbiome and promotes pathobiont overgrowth, providing clinically instructive interpretations. By substantially lowering technical barriers to pathogen identification, GPAS offers an accessible yet powerful platform for clinical diagnostics, public health surveillance, and microbiome research. The system is freely available at: https://gpas.nh.ac.cn/.

20
Detection of Malaria Infection from parasite-free blood smears

Bourriez, N.; Mahanta, S. K.; Svatko, I.; Lacassagne, E.; Atchade, A.; Leonardi, F.; Massougbodji, A.; Cohen, E.; Argy, N.; Cottrell, G.; Genovesio, A.

2026-01-05 health informatics 10.64898/2025.12.29.25343125
Top 1%
11× avg
Show abstract

Malaria affects almost 263 million people worldwide, most of whom live in sub-Saharan countries. In a strategy to reduce malaria-related mortality and limit transmission, diagnosis in endemic areas needs to be immediately available on the field, easy to perform and cheap. Therefore, it currently heavily relies on microscopic examination of blood smears. However, several studies comparing the sensitivity of this approach with qPCR, considered as the most sensitive method albeit not available on the field, found that up to half of the infected population failed to be detected by microscopy alone because no visible parasites could be found in blood smears. These so-called submicroscopic infections pose a diagnostic challenge, yet represent a huge reservoir for malaria transmission. In this study, we hypothesized that qPCR results could be predicted by deep learning from subtle cell signals present in thin blood smear images, even in the absence of visible parasites, making a sensitive diagnostic directly available on the field using a microscope and a smartphone. To test this hypothesis, we acquired a large smartphone-based blood smear images dataset from samples tested both for microscopy and qPCR. We then focused exclusively on these "negative" slides from the microscopic diagnostic point of view, among which half were qPCR positive. A range of standard deep learning models were evaluated to best predict the qPCR result from these microscopy images, using various backbones along with various aggregation functions at the slide level, from a simple vote to Multiple Instance Learning with attention. Our results show that the qPCR results can be predicted from parasite free blood smear images with 62.00% ({+/-}2.5 on 4-folds) accuracy and reaching 67.2 % ({+/-}9.6 on 4-folds) in sensitivity. We then used generative models to investigate the subtle morphological variations occurring in red blood cells that may contribute to predicting malaria infection in the absence of parasites. Leveraging thin blood smear and portable deep learning, we established the first proof of concept that the qPCR sensitivity can be approached through the detection of submicroscopic infections directly on the field without additional infrastructure and thus could significantly improve malaria surveillance and elimination efforts.